May 3, 2016

Random graph stuff

Preliminaries and notation

  • A graph \(G\) is a mathematical object that consists of a set of vertices (nodes), \(V(G)\) and a set of edges (links), \(E(G)\). Let \(e_{ij} \in E(G)\) denote the edge between vertices \(i, j \in V(G)\).
  • Throughout, \(|V(G)|\) is the number of vertices in \(G\) and \(|E(G)|\) the number of edges.
  • The degree of vertex \(i \in V(G)\), \(d_i\) is the number of edges that are connected to vertex \(i\).
  • We consider undirected (\(e_{ij} = e_{ji}\)) and loopless (\(i \not= j \forall e_{ij} \in E(G)\)) graphs.
  • The distance \(dist(i, j)\) is the minimum number of edges in a connected path from \(i\) to \(j\) in \(G\).
  • The \(d\)th order neighborhood of a vertex \(i \in V(G)\) is defined as \(N_d(u, G) = \{j \in V(G): dist(i, j) \le d\}\)

Neighborhoods

d =

(Martin 2011, Cukier (2012))

Network inference

A complete description of a network and its topology is infeasible \(\Rightarrow\) we can study the statistics of a complex network (i.e., clustering coeffients, numbers of triangles, average degree, etc.)

Let

  • \(G\) be a hypothetical undirected random graph,
  • \(F = \{f(k), k \ge 0\}\) the degree distribution of \(G\),
  • \(G_n\) be an observed realization of \(G\) of order \(n\) s.t. as \(n \rightarrow \infty\), the joint degree distributions of \(G_n\) approach the joint degree distribution of \(G\).
  • \(d_1, \dots, d_n\) be the observed degrees of the vertices in \(G_n\).

Goal: Estimate some statistic \(T_n\) of \(G\) using a block bootstrap.

A snowball's chance…

Overview

  • "Using the bootstrap for statistical inference on random graphs", Thompson, et. al. (2016); snowboot package on CRAN
  • Estimation of mean degree of \(G\), \(T_n = \mu(G)\), from a sample of vertices using a patchwork sampling method - "labelled snowball sampling with multiple inclusions (LSMI)", from survey sampling
  • seeds \(=\) initial random sample of single vertices
    patch \(=\) formed by following paths of connected vertices emanating from the seed up to a certain distance
    wave \(=\) increments in distance
  • The degrees of all vertices visited in a patch are observed
  • bootstrap seeds (randomly sample seeds with replacement and estimate the mean degree) then non-seeds (non-weighted or weighted selection proportional to reciprocals of their degrees); repeat \(T \times B\) bootstrap replications.

Sampling Algorithm

Data: graph \(G_n\); number of seeds \(m\), \(m < n\); number of waves \(d\)
Result: sample of \(m\) seeds with \(d\) waves around each seed.
seed \(=\) randomly sample without replacement \(m\) vertices from \(G_n\)
for \(q = 1, \dots, m\) do
    start with original \(G_n\) and \(seed_q\)
    for \(w = 1, \dots, d\) do
        \(wave_w =\) select all immediate neighbors using the existing edges
        remove the used edges
    end
    \(patch_q =\) join the current \(seed_q\) and all \(wave_w\) keeping the repeated elements
end
join all \(m\) patches, keeping all of the repeated elements

Additionally, sample non-seeds according to non-weighted or weighted selection.

Repeat \(B\) times and calculate the estimate, \(T_n^*\) for each replicate.

Note: The authors recommend running this procedure \(T\) times to get \(T\) bootstrap distributions because high variability among LSMIs

Estimator

\[T_n^* = \frac{\dfrac{1}{m}\sum_{q=1}^{m}d_q + D(1-\hat{p}_0)\dfrac{1}{m}\sum_{q=1}^m \tilde{A}_{NSq}}{\dfrac{1}{m}\sum_{q=1}^m 1 + D\dfrac{1}{m}\sum_{q=1}^m \tilde{B}_{NSq}}\]
Where

  • \(\hat{p}_0\) is the relative frequency of vertices of degree zero among the seeds,
  • \(D\) is an arbitrary conbination factor depending on the degree distribution, for simplicity set \(D =\) the average degree of seeds.
  • \(\tilde{A}_{NSq}\) and \(\tilde{B}_{NSq}\) are the wave-combined Horwitz-Thompson estimators of the sum of the degrees of all vertices in \(G_n\) and the number of vertices in \(G_n\) having positive degree, respectively.
    \[\tilde{A}_{NSq} = \sum\limits_{w = 1}^d \sum\limits_{j \in \text{ wave } w \text{ of } q} 1 \text{ and } \tilde{B}_{NSq} = \sum\limits_{w = 1}^d \sum\limits_{j \in \text{ wave } w \text{ of } q} \frac{1}{d_j}\]

Properties

Conditions:

  1. The size of a \(d\)-wave patch has its third absolute moment bounded above in probability.
  2. The variances of both the size of a \(d\)-wave patch and the number of non-seeds per first neighbor for a \(d-\)wave patch are bounded below in probability.
  3. The expectaion of the excess degree of each neighbor of a random seed is independent of the degree of the seed (to order \(o_p(1)\))

Theorem:

  1. Suppose that the degree distribution of \(G\) has a positive variance and a finite fourth moment. Also suppose that under the mechanism generating a degree sequence and \(G_n\), C1 and C2 are satisfied. Then the LSMI estimator is asymptotically normal as \(n\) and \(m\) tend to \(\infty\) with \(m \le n/2\).
  2. If either \(d = 1\) or \(d > 1\) and C3 holds, then the LSMI estimator is consistent for the mean degree of G.

This is complicated


  • Variability among LSMIs is high - due to sampling scheme being variable?
  • How does the variance of the estimator compare to the expected variance? (Does this even really work?)
  • How do we reconcile these \(T\) bootstrap distributions?
  • Tuning - have to choose a seed-wave (\(w\) and \(d\)) combination, choice of two tuning parameters, which may also have sensitivy to interaction effects.
  • Extensibility - how hard is it to extend this to other statistics?

Here's an idea…

Something simpler

Let's try to estimate \(T_n =\) the # of edges in \(G\) \(= e\) to start.

Algorithm

Data: graph \(G_n\); average block size, \(m = \frac{1}{n}\sum\limits_{i =1}^n \{d_i + 1\}\); number of blocks, \(k = \lfloor\frac{n}{m}\rfloor\); intra-block connection probability, \(\hat{p}\).
Result: \(B\) estimates of \(T_n\)
for \(i = 1, \dots, B\) do
    start with original \(G_n\), select \(k\) indices \(j \in 1, \dots, n\) with replacement.
    for \(j = 1, \dots, k\) do
        \(B_j =\) select all vertices \(k \in N_1(j)\) and vertex \(j\), keeping all the existing edges
    end
    join the disconnected subgraphs (\(B^*_1, \dots, B^*_k\)) via iid \(Bern(\hat{p})\) draws. The result is \(G^*_i\).
    Calculate \(T_n^*\) for \(G^*_i\).
end

\[T_n^* = |E(G^*_i)| = e^*_i\]

Properties

For \(T_n = e\), if \(\hat{p} = \frac{2(n-2k)}{k(k-1)m^2n} e_n\) where \(e_n = |E(G_n)|\) then

  • \(\mathbb{E}_*\left(T_n^*|G_n\right) = e_n\).
  • \(\text{Var}_*\left(T_n^*|G_n\right) = \frac{k}{n}\sum\limits_{i = 1}^n(|E(B_i)|^2) - \frac{4k}{n^2} e_n^2 - \frac{4(n-2k)^2}{k(k-1)m^2n^2}e_n^2 + \frac{2(n-2k)}{n}e_n\).

See Appendix for details.

Simulated examples

Simulated Erdos-Renyi graphs of size \(n = 10, 50, 100, \dots, 500\) and used the "simple" block scheme to estimate the distribution of \(T_n = e_n\).

Adjusted statistic

Try instead, \(S_n^* = (T_n^* - \mathbb{E}_*(T_n^*))/\sqrt{\text{Var}_*(T_n^*)}\)

Thoughts

  • Could extend this procedure using a tuning parameter, \(d\), the distance to define a block.
    • Currently, \(d = 1\). Would need to adjust \(\hat{p}\) and \(m\) accordingly.
    • This could be useful for different statistics \(T_n\) that we are interested in (like number of triangles, mean degree, etc.)
  • Need to look at examples with different graph generation mechanisms
  • Variance properties/asymptotics/limiting distribution unclear
  • Compare to LSMI (would need corresponding \(\hat{p}\) for unbiased estimate of \(\mu(G)\))

Appendix

Properties details

For \(\hat{p} = \frac{2(n-2k)}{k(k-1)m^2n} e_n\), where \(e_w^*\) denotes the edges within blocks in the bootstrap sample and \(e_b^*\) denotes the edges between blocks in the bootstrap sample.

\[ \begin{aligned} \mathbb{E}_*\left(T_n^*|G_n\right) &= \mathbb{E}_*\left(e^*|G_n\right) \\ &= \mathbb{E}_*\left(e_w^* + e_b^*|G_n\right) \\ &= \sum\limits_{i = 1}^k \mathbb{E}_*\left(|E(B^*_i)|\right) + \sum\limits_{i = 1}^{k-1}\sum\limits_{j = i + 1}^k\mathbb{E}_*\left(\text{ # edges between } B^*_i \text{ and } B^*_j\right) \\ &= \frac{k}{n} \sum\limits_{i = 1}^n|E(B_i)| + \frac{k(k-1)}{2} \left(\frac{1}{n}\sum\limits_{i = 1}^n |V(B_i)|\right)^2 \hat{p} \\ &= \frac{2k}{n} e_n + \frac{k(k-1)}{2}m^2 \frac{2(n-2k)}{k(k-1)m^2n} e_n \\ &= e_n \end{aligned} \]

Provided \(k > 1\), and \(\hat{p} < 1\). A sufficient (but not necessary condition for \(\hat{p} < 1\), would be \(n < k(k-1)m^2 + 2k \approx n(n-m) + 2n/m\). In other words, \(n - m + 2/m > 1\) is approximately sufficient for \(\hat{p} < 1\), and is a very mild condition.

Properties details (Cont'd)

\[ \begin{aligned} \text{Var}_*(T_n^*|G_n) &= \text{Var}_*(e_w^* + e_b^*|G_n) \\ &= \text{Var}_*(e_w^*|G_n) + \text{Var}_*(e_b^*|G_n) + \text{Cov}_*(e_w^*, e_b^*|G_n) \\ ~\\ \text{Var}_*(e_w^*|G_n) &= \text{Var}_*\left(\sum\limits_{i = 1}^k |E(B_i^*)||G_n\right) \\ &= \sum\limits_{i = 1}^k \sum\limits_{j = 1}^k\text{Cov}_*(|E(B_i^*)|, |E(B_j^*)||G_n) \\ &= \sum\limits_{i = 1}^k\text{Var}_*( |E(B_i^*)||G_n) \\ &= k\mathbb{E}_*((|E(B_i^*)| - \mathbb{E}_*(|E(B_i^*)||G_n))^2|G_n) \\ &= \frac{k}{n}\sum\limits_{i = 1}^n\left(|E(B_i^*)| - \frac{2}{n}e_n\right)^2 \\ &= \frac{k}{n}\sum\limits_{i = 1}^n|E(B_i^*)|^2 - \frac{4k}{n^2}e_n^2 \end{aligned} \]

Properties details (Cont'd)

\[ \begin{aligned} \text{Var}_*(e_b^*|G_n) &= \text{Var}_*\left(\sum\limits_{i = 1}^{k-1} \sum\limits_{j = i + 1}^{k}(\text{ # edges between } B_i^* \text{ and } B_j^*)|G_n\right) \\ &= \sum\limits_{i = 1}^{k-1} \sum\limits_{j = i + 1}^{k}\sum\limits_{l = 1}^{k-1} \sum\limits_{m = i + 1}^{k}\text{Cov}_*\left(\text{ # edges btw } B_i^* \text{ and } B_j^*, \text{ # edges btw } B_l^* \text{ and } B_m^*|G_n\right) \\ &= 2 \sum\limits_{i = 1}^{k-1} \sum\limits_{j = i + 1}^{k}\text{Var}_*\left(\text{ # edges between } B_i^* \text{ and } B_j^*|G_n\right) \\ &= k(k-1)m^2\hat{p}(1-\hat{p}) \\ &= \frac{2(n-2k)}{n}e_n - \frac{4(n-2k)^2}{k(k-1)m^2n^2}e_n^2 \\ ~\\ ~\\ \text{Cov}_*(e_w^*, e_b^*|G_n) &= 0 \end{aligned} \]

References

Cukier, Jerome. 2012. “Events in the Game of Thrones.” December. http://www.jeromecukier.net/projects/agot/events.html.

Martin, George RR. 2011. A Dance with Dragons. New York, NY, USA: Bantam Books.

Thompson, Mary E., Lilia L. Ramirez Ramirez, Vyacheslav Lyubchich, and Yulia R. Gel. 2016. “Using the Bootstrap for Statistical Inference on Random Graphs.” Canadian Journal of Statistics 44 (1): 3–24.